Model Selection

Self-supervised learning

# Self-supervised learning

Vjepa2 Vitl Fpc64 256

V-JEPA 2 is a cutting-edge video understanding model developed by the FAIR team under Meta. It extends the pre-training objectives of VJEPA and has industry-leading video understanding capabilities.

Video Processing

Midnight-12k is a foundational pathology model trained with self-supervised learning on a small dataset, achieving performance comparable to leading models

Image Classification

Safetensors English

Izanami Wav2vec2 Large

Japanese wav2vec2.0 Large model pre-trained on large-scale Japanese TV broadcast audio data

Speech Recognition Japanese

Kushinada Hubert Base

Japanese speech feature extraction model pre-trained on 62,215 hours of Japanese TV broadcast audio data

Speech Recognition Japanese

RNA foundation model pre-trained on non-coding RNA data using masked language modeling (MLM) objective

Safetensors Other

voc2vec is a foundational model specifically designed for non-linguistic human data, built upon the wav2vec 2.0 framework.

Audio Classification

Transformers English

Videomaev2 Base

VideoMAEv2-Base is a self-supervised video feature extraction model that employs a dual masking mechanism pre-trained on the UnlabeldHybrid-1M dataset.

Video Processing

RNABERT is a pre-trained model based on non-coding RNA (ncRNA), employing Masked Language Modeling (MLM) and Structural Alignment Learning (SAL) objectives.

Molecular Model Other

Ijepa Vitg16 22k

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on manual data transformations or filling in pixel-level details.

Image Classification

Ijepa Vith16 1k

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on predefined manual data transformations or pixel-level detail filling.

Image Classification

Ijepa Vith14 22k

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on predefined manual data transformations or pixel-level detail filling.

Image Classification

Ijepa Vith14 1k

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on manual data transformations or filling in pixel-level details.

Image Classification

Dinov2.large.patch 14

DINOv2 large is a large-scale visual feature extraction model based on self-supervised learning, capable of generating robust image feature representations.

Vision Transformer model trained with self-supervised DINOv2, specifically designed for encoding chest X-ray images

Image Classification

Ahma-7B is a 7-billion-parameter decoder-only Transformer model based on Meta Llama(v1) architecture, fully pretrained from scratch using Finnish language.

Large Language Model

Transformers Other

Vit Small Patch8 224.lunit Dino

An image classification model based on the Vision Transformer (ViT), trained on 33 million histological sections using the DINO self-supervised learning method, suitable for pathological image classification tasks.

Image Classification

Phikon is a self-supervised learning model for histopathology based on iBOT training, primarily used for extracting features from histology image patches.

Image Classification

Transformers English

Hubert Base Audioset

Audio representation model based on HuBERT architecture, pre-trained on the complete AudioSet dataset, suitable for general audio tasks

Audio Classification

Pubchemdeberta Augmented

TwinBooster is a DeBERTa V3 base model fine-tuned on the PubChem bioassay corpus, combining Barlow Twins self-supervised learning method and gradient boosting techniques to enhance molecular property prediction.

Molecular Model

Transformers English

Japanese Hubert Base

Japanese HuBERT base model trained by rinna Co., Ltd., based on approximately 19,000 hours of Japanese speech corpus ReazonSpeech v1.

Speech Recognition

Transformers Japanese

Data2vec Vision Base Ft1k

Data2Vec-Vision is a self-supervised learning model based on the BEiT architecture, fine-tuned on the ImageNet-1k dataset, suitable for image classification tasks.

Image Classification

Data2vec Vision Large Ft1k

Data2Vec-Vision is a self-supervised learning vision model based on the BEiT architecture, fine-tuned on the ImageNet-1k dataset, suitable for image classification tasks.

Image Classification

Data2vec Vision Large

Data2Vec-Vision is a self-supervised learning model based on the BEiT architecture, pre-trained on the ImageNet-1k dataset, suitable for image classification tasks.

Image Classification

Data2vec Vision Base

Data2Vec-Vision is a self-supervised learning model based on the BEiT architecture, pretrained on the ImageNet-1k dataset, suitable for image classification tasks.

Image Classification

Wav2vec2 Large 10min Lv60 Self

This model is a large-scale speech recognition model based on the Wav2Vec2 architecture, pre-trained and fine-tuned on 10 minutes of data from Libri-Light and Librispeech, using self-training objectives, suitable for 16kHz sampled speech audio.

Speech Recognition

Transformers English

Data2vec Audio Large 960h

Data2Vec is a general self-supervised learning framework applicable to speech, vision, and language tasks. This large audio model is pre-trained and fine-tuned on 960 hours of LibriSpeech data, specifically optimized for automatic speech recognition tasks.

Speech Recognition

Transformers English

Data2vec Audio Large

Data2Vec-Audio-Large is a large model pre-trained on 16kHz sampled speech audio using a self-supervised learning framework, suitable for tasks such as speech recognition.

Speech Recognition

Transformers English

Data2vec Text Base

A general self-supervised learning framework pre-trained on English language using the data2vec objective, handling different modality tasks through a unified approach

Large Language Model

Transformers English

A Transformer model pre-trained on English corpus using ELECTRA-like objective functions, learning intrinsic representations of English language through self-supervised methods.

Large Language Model

Transformers English

funnel-transformer

A self-supervised Vision Transformer model trained using the DINO method, suitable for image feature extraction

Image Classification

Wav2vec2 Spanish

Pre-trained speech recognition model based on Common Voice Spanish data, trained on TPU using the Flax framework

Speech Recognition Spanish

Albert Fa Base V2 Clf Digimag

The first lightweight ALBERT model for Persian language, trained based on Google ALBERT BASE 2.0 version

Large Language Model

Transformers Other

Wav2vec2 Large Es Voxpopuli

Large-scale speech pre-training model trained on the Spanish subset of the VoxPopuli corpus, suitable for Spanish speech recognition tasks

Speech Recognition Spanish

Funnel Transformer is an English text pre-training model based on self-supervised learning, adopting objectives similar to ELECTRA, achieving efficient language processing by filtering sequence redundancy.

Large Language Model

Transformers English

funnel-transformer

Tf Xlm Roberta Base

XLM-RoBERTa is an extended version of a cross-lingual sentence encoder, trained on 2.5T of data across 100 languages, achieving excellent performance in multiple cross-lingual benchmarks.

Large Language Model

Hubert Large Ls960 Ft

HuBERT-Large is a self-supervised speech representation learning model fine-tuned on 960 hours of LibriSpeech data for automatic speech recognition tasks.

Speech Recognition

Transformers English

Tf Xlm Roberta Large

XLM-RoBERTa is a large-scale cross-lingual sentence encoder, trained on 2.5TB of data across 100 languages, achieving excellent performance in multiple cross-lingual benchmarks.

Large Language Model

Polish text generation model based on GPT2 architecture, filling the gap in Polish NLP field, trained on multilingual Oscar corpus

Large Language Model Other

Core Clinical Diagnosis Prediction

The CORe model is based on BioBERT and trained on medical data through clinical outcome pre-training objectives for predicting ICD9 diagnosis codes from admission records.

Text Classification

Transformers English

S2t Wav2vec2 Large En De

Transformer-based end-to-end speech translation model, specifically designed for English-to-German speech translation

Speech Recognition

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase